.
THE OFFICE

TEXT ANALYSIS


This project’s main purpose is to analyze a TV show in a reliable and measurable way, without the need to watch the whole show or rely on a personal perspective. The selected subject for this analysis is the sitcom β€˜The Office’, which was selected mainly for the high availability of data.

This notebook will use the data previously collected and cleaned, to go through the analysis process.

IntroductionΒΆ

PurposeΒΆ

To analyze a TV show in a reliable and measurable way, without the need to watch the whole show or relying on a personal perspective.

QuestionsΒΆ

The main questions this analysis will try to answer are described below:

Who are the main characters?
This question can be defined as a descriptive question, where the analysis will use simple statics to identify the most relevant characters.

How they communicate? This is an explanatory question, where we'll use both uses the descriptive statistics together with some linguistic methods of natural language processing to explain the relations between the characters, the polarity in their dialogs, and what words and terms could be used to describe them.

Selected subject - The Office (US)ΒΆ

The selected show for this project was a sitcom from NBC, named The Office.
This TV series was aired from 2005 to 2013 and is still one of the most-watched shows on Netflix.

logo

IMDb describes the show as:

β€œA mockumentary on a group of typical office workers, where the workday consists of ego clashes, inappropriate behavior, and tedium.”

The criteria for selecting this tv show are:

  • Complete and Reliable Data:
    The Office has 9 seasons and about 201 episodes, this should provide the project with enough dialogs to be analyzed;
  • Tested by the Audience:
    The Office has been on TV and digital media for almost 15 years, this means there are huge amounts of reviews and ratings available online.

Data CollectionΒΆ

The data collection process used two scripts their main functionalities are listed below:

The_Office_Scraper.py

  • Navigates transcripts.foreverdreaming.org;
  • Collects the URLs to where the transcripts are stored;
  • Navigates each of the collected URLs;
  • Collects the dialogs in the transcripts;
  • Create a CSV file to store the collected data;

    The CSV contains the fields:

  • ID number (Unique);
  • Dialog – The text containing what a character said;
  • Character name – The name of the character that said the dialog;
  • Episode name – The episode name as displayed on the website;  

IMDB_Scraper.py

  • Navigates imdb.com;
  • Collect the current rating listed for each episode;
  • Create a CSV file to store the collected data;

    The CSV file contains the fields:

  • ID number (Unique);
  • Rating – The average ratings given by the viewers of the episode;
  • Episode name – The title of the episode;
  • Episode number – The episode number, unique to the season;
  • Season number – The season of the episode;

RequirementsΒΆ

The project was built with Python 3.7, it uses a mix of Python scripts and Jupyter notebooks.


In the previous notebooks we used mostly Pandas, Numpy, BeautifulSoup, NLTK, and VADER to collect, clean and prepare the data.

The libraries used in this notebook are:

Pandas

for data wrangling;  

Numpy, Sklearn, sciPy, and spaCy

for mathematical, statistical and machine-learning related tasks;  

NetworkX, matplotlib, Seaborn

for visualizations;
Main dataset, first 5 rows:
Out[5]:
id text name episode_name episode_number season ep_seas clean_txt sentences words sentences_qty words_qty negative neutral positive compound
0 2 All right Jim. Your quarterlies look very goo... Michael PILOT 01 1 01-01 right jim quarterlies look good things library [' All right Jim.', 'Your quarterlies look ver... ['all', 'right', 'jim', 'your', 'quarterlies',... 3 14 0.0 0.803 0.197 0.4927
1 3 Oh, I told you. I couldn't close it. So... Jim PILOT 01 1 01-01 oh told couldnt close [' Oh, I told you.', "I couldn't close it.", '... ['oh', 'i', 'told', 'you', 'i', 'couldnt', 'cl... 3 9 0.0 1.000 0.000 0.0000
2 4 So you've come to the master for guidance? Is... Michael PILOT 01 1 01-01 youve come master guidance youre saying grassh... [" So you've come to the master for guidance?"... ['so', 'youve', 'come', 'to', 'the', 'master',... 2 14 0.0 1.000 0.000 0.0000
3 5 Actually, you called me in here, but yeah. Jim PILOT 01 1 01-01 actually called yeah [' Actually, you called me in here, but yeah.'] ['actually', 'you', 'called', 'me', 'in', 'her... 1 8 0.0 0.714 0.286 0.4215
4 6 All right. Well, let me show you how it's done. Michael PILOT 01 1 01-01 right well let show done [' All right.', "Well, let me show you how it'... ['all', 'right', 'well', 'let', 'me', 'show', ... 2 10 0.0 0.811 0.189 0.2732

Main CharactersΒΆ

The first step to analyze the show is to define who are the main characters.

This may have several different interpretations, but for this project, we are considering some metrics to do so. The points to be considered are:

  • The total amount of dialogs a character had;
  • The total amount of episodes the character had a dialog in;
  • The number of seasons the character appeared on;

The number of dialogs and episodes is the main indicator of who are the main characters since the characters who had the biggest proportions of dialog and appeared in most of the episodes receive more attention and therefore should be the main characters.

There's a challenge in this part because of special guests and characters which had very big importance for a short amount of time. Those characters appear to had lots of dialogs, they participate in lots of episodes, but they're just around for a couple of seasons at most, this is why we're considering the season to calculate the main characters score.

To solve this issue I developed a score that considers all the above-mentioned approaches.

FactsΒΆ


Group by characterΒΆ

We start by aggregating the numerical fields and get their respective descriptive statistics such as means, standard deviations, medians, and other aggregations. This will help us calculate the scores, but will also serve as the basis for all the other analysis comparing the characters.

ScoreΒΆ

This is the indicator I developed to help to find the main characters of the series and classify them by relevance to the show.

Score = nep + (nd / nep) * (ns/5)

nep = number episodes;
nd = number of dialogs;
ns = number of seasons;

5 is a threshold I used.
The idea is to "penalize" characters that appeared in less than 5 seasons (approx. half the series) and give more significance to characters that appeared in more than 5 seasons.

Out[8]:
chars dialogs avg_words std_words 25%_median_words 50%_median_words 75%_median_words avg_sentences std_sentences positive neutral negative compound total_words unique_s unique_ep score
184 Michael 10960.0 13.585675 16.290656 4.0 8.0 17.0 2.343704 2.162617 0.193529 0.731349 0.075123 0.146149 148899 8 139 265.158273
81 Dwight 6852.0 11.050058 12.467607 3.0 7.0 14.0 1.972709 1.624145 0.155236 0.762325 0.082144 0.083521 75715 9 188 253.604255
131 Jim 6314.0 9.225055 10.859843 3.0 6.0 12.0 1.649034 1.195340 0.195979 0.738376 0.065648 0.135329 58247 9 187 247.776471
205 Pam 5035.0 8.962860 10.792735 3.0 6.0 11.0 1.598808 1.242186 0.198712 0.734246 0.067045 0.129266 45128 9 184 233.255435
153 Kevin 1567.0 8.002553 8.765353 2.0 5.0 10.0 1.560306 1.068548 0.179865 0.744870 0.075264 0.094385 12540 9 182 197.497802
12 Angela 1564.0 8.598465 9.622824 3.0 6.0 11.0 1.615729 1.263710 0.153496 0.739256 0.106611 0.053237 13448 9 173 189.272832
10 Andy 3780.0 11.747884 12.704352 3.0 8.0 16.0 1.981746 1.574955 0.176831 0.752245 0.070660 0.135988 44407 7 145 181.496552
204 Oscar 1366.0 8.791362 9.039496 3.0 6.0 12.0 1.560761 1.063706 0.152761 0.784579 0.062661 0.080598 12009 9 166 180.812048
216 Phyllis 976.0 7.695697 7.513722 3.0 6.0 10.0 1.376025 0.721110 0.159429 0.768445 0.072122 0.085728 7511 9 168 178.457143
251 Stanley 686.0 8.651603 8.996003 3.0 6.0 11.0 1.447522 0.960239 0.110771 0.801340 0.087892 0.026518 5935 9 168 175.350000
238 Ryan Howard 1212.0 9.922442 10.506602 3.0 6.0 13.0 1.623762 1.185668 0.172592 0.764349 0.063062 0.120825 12026 9 142 157.363380
148 Kelly 848.0 10.920991 11.498150 4.0 8.0 13.0 1.678066 1.195778 0.149959 0.771801 0.078242 0.094249 9261 9 144 154.600000
182 Meredith 562.0 7.798932 6.972096 3.0 6.0 11.0 1.626335 1.199770 0.164235 0.741536 0.094222 0.049518 4383 9 134 141.549254
58 Creed 408.0 9.593137 10.038312 3.0 6.0 11.0 1.784314 1.418567 0.142757 0.779772 0.075025 0.077308 3914 8 130 135.021538
67 Darryl 1234.0 9.528363 9.806295 3.0 6.0 12.0 1.719611 1.170392 0.161492 0.766948 0.071566 0.093145 11758 9 111 131.010811

InterpretationΒΆ

With help from theoffice.fandom.com, which is the 'wiki' webpage for the series, we can outline some information about the characters and compare their relationship with the other characters that had a similar score.

Michael is the Manager of the office, and according to our score is the lead character of the show, followed by Dwight who's the 'Assistant to the Regional Manager' for most of the show.

The list follows with:

  • Jim and Pam, who play a romantic couple in the show.

  • Kevin, Angela, Oscar (after Andy), they all work in the accounting and sit close to each other.

  • Andy, who's score is in the between the accounting people, is a salesman that entered later in the series.

  • Phyllis and Stanley are both sales-people and sit in front of each other.

  • Ryan and Kelly also play a romantic couple in the show.

  • Meredith and Creed, have a more cartoonish aspect to their characters.

  • Darryl, who's in the show from the beginning but it doesn't appear so much in the earlier seasons since he works bellow the office at the warehouse.

Out[11]:
chars unique_ep dialogs unique_s score
184 Michael 139 10960.0 8 265.158273
81 Dwight 188 6852.0 9 253.604255
131 Jim 187 6314.0 9 247.776471
205 Pam 184 5035.0 9 233.255435
153 Kevin 182 1567.0 9 197.497802
12 Angela 173 1564.0 9 189.272832
10 Andy 145 3780.0 7 181.496552
204 Oscar 166 1366.0 9 180.812048
216 Phyllis 168 976.0 9 178.457143
251 Stanley 168 686.0 9 175.350000
238 Ryan Howard 142 1212.0 9 157.363380
148 Kelly 144 848.0 9 154.600000
182 Meredith 134 562.0 9 141.549254
58 Creed 130 408.0 8 135.021538
67 Darryl 111 1234.0 9 131.010811

Something that stands out in the above listing is the number of seasons, while most characters selected by the score have participated in all the nine seasons there are three characters that didn't.

Those characters are Michael, Andy, and Creed. I decided to research why those characters have less seasons than the other main characters and this leed to some interesting information.

  • Andy, only entered the show in the third season.

  • Creed, according to theoffice.fandom, appeared in the background of the first season, but didn't have any dialogs.

  • Michael that even with one less season achieved the highest score, left the show in the middle of the seventh season, and returned only for the last episode, so he didn't participate at all in the eight season.

Episodes, Dialogs and SeasonsΒΆ

To test our score we can compare the distributions for our selected variables.

In the bellow chart the values are displayed as:

  • X-axis = Number of episodes;
  • Y-axis = Number of dialogs;
  • Size = Number of seasons (more seasons = bigger markers);

The chart compares the main characters(red) selected by the score with all the other characters(blue).

We can see that the score works properly for selecting the characters with most episodes and season. All the characters on the right side of the chart, with a high number of episodes, were selected by our algorithm.

Total and Average DialogsΒΆ

We can also see how the score is handling the 'Average dialogs', in the bellow chart we have:

  • X-axis = Average Dialogs;
  • Y-axis = Total Dialogs;
  • Size = Number of Seasons;

We can see with the above chart that the number of seasons was great at separating characters to select not just the characters with a high average dialogs per episode, but the ones who also participated throughout the whole show.

ConclusionΒΆ

The main characters are:

Imgur

*The above list is for visual display of the characters and is not sorted.

Words per DialogΒΆ

A very interesting characteristic we can analyze is the number of words and sentences a character says, the concept for this analysis is that if you have a high average of those numbers you are either too subjective or you have lots to say.

One hypothesis is that you may have too much or too complex information to communicate, in this case you would have lots to say. The alternative would be that you don't have to communicate much, but you are using too much words for it, this would mean you're beign subjective.

FactsΒΆ

MeanΒΆ

In the bellow displayed chart, we can see the blue bars representing the means, and the black lines the standard deviation of those, the problem, in this case, is that there's a huge difference between the mean and the standard deviations. This means our data have extreme outliers, so the averages are not such a good indication of who talks more or less, they just give us a slight idea of it.

MedianΒΆ

Since we're not able to get the full understanding with the means, we can analyse the medians for those characters.

InterpretationΒΆ

Medians complement our understanding, we can see that Michael and Andy still have the highest numbers, Dwight and Kelly switched places as the third and fourth, most characters have the same median which is 6, and apparently Kevin says less words than all the others.

We can say that this reinforces the importance of Michael and Dwight as a leading characters, but why does Andy has the second highest average and median and not Dwight?

Andy's high number of dialogs and amount of words suggest that even though he entered the show later he was very important to the plot of the show.


Some other insights we can support from this data are:

Talking a lot is actually one of Kelly's personality traces, and this is actualy commented on and joked about trough the series.

Kevin, from all the main characters uses in median the lowest amount of words per dialogs, but this doesn't mean he's extremely objctive or consise in his conversations.

ConclusionΒΆ

  • Michael is 'The' main character of the show;
  • Andy is very important from the middle to the end of the show;
  • Kelly is known for talking too much;
  • Kevin is definitely not the brightest character;

Sentiment analysisΒΆ

There are plenty different methods to develop a sentiment analysis, one of the most evident methods are the pre-trained machine learning models, those are ready to go algorithms that can classify texts into positive, negative and neutral.
The problem found with all the pre-trained models researched is they are either trained with social media data, or with products review/ ratings data. Those means of communication differ a lot from the data we’re analysing so it wouldn’t be appropriate to use them. Besides the pre-trained models, there are a few other open sourced algorithms to train our own models, the biggest problem with this would be to label our data to train a model, and this would require to much time and resources.

The best solution found to satisfy the needs of the project was VADER.
(Valence Aware Dictionary and sEntiment Reasoner) https://github.com/cjhutto/vaderSentiment

VADER uses a dictionary to assign scores to the words, while considering their location within the text and punctuations to score the document with a proportion of each sentiment contained on it. Those sentiments are named negative, positive and neutral.

After getting the proportions for each sentiment VADER calculates a compound. The compound is a normalized sum of all proportions, from -1 (completely negative) to 1 (completely positive).

Some of the semantic contexts considered by VADER are:

    Conjunctions        E.g.: 'I like your X, but your Y is very bad';
    Negation Flips      E.g.: 'This is not really the greatest';
    Degrees         E.g.: 'This is good' vs 'This is extremely good';
    Capitalization      E.g.: 'this is GREAT' vs 'this is great';
    Punctuation         E.g.: 'this is great!!!' vs 'this is great'; 

FactsΒΆ

Visualize Polarities

The visualization of the results aims at displaying the characters of the show, and the average positive and negative sentiments for each of them.

For a fair perspective of those values we're comparing them in the same scales where:

  • Range = [ min( positive, negative ), max( positive, negative ) ]

So the range of 0.04 to 0.23, is applied to both the x and y axis.

InterpretationΒΆ

We can see that most characters have a similar behavior in matters of polarity in their dialogs, the values concentrate in high positive and low negative for the vast majority of them, but we can also see some outliers away from the group.

OutliersΒΆ

As mentioned before, most of the characters have a high positive score of around 0.14 to 0.20, with a low negative score of 0.06 to 0.8.
But we can note some characters with higher negative scores and also a character with a lower positive score.

Stanley, is the most distant from the other characters, he has a relatively low positive score but his negative score isn't so high either.

This means his dialogs are mostly neutral, almost like he doesn't want to get involved.

Out[20]:
chars dialogs avg_words positive neutral negative compound unique_s unique_ep score
182 Meredith 562.0 7.798932 0.164235 0.741536 0.094222 0.049518 9 134 141.549254
12 Angela 1564.0 8.598465 0.153496 0.739256 0.106611 0.053237 9 173 189.272832
251 Stanley 686.0 8.651603 0.110771 0.801340 0.087892 0.026518 9 168 175.350000

ConclusionΒΆ

Most of the show goes around positive dialogs.
Some outliers have a lower amount of positive dialogs and some have a higher amount of negative dialogs, but even they had in overall higher positive dialogs than negatives.

RelationshipsΒΆ

The file 'conversations.json' contains one record for every scene on the show, where the record contains the name of the characters that had some dialog in the scene and the respective number of dialogs that character had.

These conversations will be used to calculate a score for the relations between the characters.

first 5 rows:
Out[21]:
[{'Michael': 3, 'Jim': 2},
 {'Michael': 5, 'Pam': 4},
 {'Michael': 6, 'Jim': 3, 'Dwight': 2},
 {'Michael': 11, 'Pam': 2},
 {'Phyllis': 1, 'Stanley': 1}]

FactsΒΆ


ScoresΒΆ

In order to compare the relationship between the characters the following formula was developed:

βˆ‘min(nx,ny)/max(nx,ny)

Where:

nx = number of dialogs character x had in a conversation;
ny = number of dialogs character y had in a conversation;


This score is based on the concept that a perfectly balanced conversation will have the same amount of dialogs between both agents.

E.g.: A conversation with three characters x, y and z;
Where x said 5 dialogs, y said 5 dialogs, and z said 1 dialog will result in a score between x and y of 1, while the score between x and z will be 0.2.


The scores are them aggregated with all scores from the same relation so they can be compared, it's important to note that this will result in generally higher scores for characters that communicate a lot and lower scores for characters that don't.

After calculating the relationship scores for every character of the show we have the following data:

Out[24]:
Kevin Phyllis Meredith Angela Stanley Oscar Pam Jim Darryl Creed Ryan Howard Dwight Kelly Andy Michael
names
Kevin 0.000000 109.195238 68.806818 154.564286 65.969048 170.094913 147.983586 142.731019 66.178510 49.209524 44.191450 123.556432 58.044444 125.062612 108.404618
Phyllis 109.195238 0.000000 58.662338 103.914286 133.944048 107.298810 130.937843 118.793685 45.466667 41.485714 41.692857 142.265707 55.544444 105.159174 82.771664
Meredith 68.806818 58.662338 0.000000 59.121861 44.815909 69.755195 76.134199 53.527924 25.266667 32.631818 22.500000 69.378304 39.252020 59.419120 51.747815
Angela 154.564286 103.914286 59.121861 0.000000 56.385714 159.026190 124.944444 70.692491 22.976190 32.319048 25.576190 196.329949 60.100000 84.234423 78.417826
Stanley 65.969048 133.944048 44.815909 56.385714 0.000000 71.201190 77.652092 84.074060 22.366667 34.250000 38.444048 93.269719 31.594444 76.107937 77.463557
Oscar 170.094913 107.298810 69.755195 159.026190 71.201190 0.000000 127.759679 107.622387 52.383333 48.391667 46.430357 117.605836 46.766703 96.470033 99.879648
Pam 147.983586 130.937843 76.134199 124.944444 77.652092 127.759679 0.000000 587.992857 52.899206 42.349639 81.944931 238.176138 75.771176 130.851199 301.972284
Jim 142.731019 118.793685 53.527924 70.692491 84.074060 107.622387 587.992857 0.000000 58.863492 46.019208 87.016138 452.530350 63.588468 183.466818 293.039036
Darryl 66.178510 45.466667 25.266667 22.976190 22.366667 52.383333 52.899206 58.863492 0.000000 11.342857 20.764286 60.986722 24.885714 85.220202 69.502232
Creed 49.209524 41.485714 32.631818 32.319048 34.250000 48.391667 42.349639 46.019208 11.342857 0.000000 20.602381 49.598413 21.111111 32.715666 40.354334
Ryan Howard 44.191450 41.692857 22.500000 25.576190 38.444048 46.430357 81.944931 87.016138 20.764286 20.602381 0.000000 89.840079 79.769444 40.667532 124.703359
Dwight 123.556432 142.265707 69.378304 196.329949 93.269719 117.605836 238.176138 452.530350 60.986722 49.598413 89.840079 0.000000 65.374206 206.445069 456.741376
Kelly 58.044444 55.544444 39.252020 60.100000 31.594444 46.766703 75.771176 63.588468 24.885714 21.111111 79.769444 65.374206 0.000000 53.457479 57.788593
Andy 125.062612 105.159174 59.419120 84.234423 76.107937 96.470033 130.851199 183.466818 85.220202 32.715666 40.667532 206.445069 53.457479 0.000000 113.068009
Michael 108.404618 82.771664 51.747815 78.417826 77.463557 99.879648 301.972284 293.039036 69.502232 40.354334 124.703359 456.741376 57.788593 113.068009 0.000000

InterpretationΒΆ


Visualize ScoresΒΆ

At this point we'll start comparing the relationships and describing them as 'strong' or 'weak', depending on the value of their scores. It's important to note that a strong relationship in this context doesn't relate to the sentiment involved between the characters, so it won't necessarily be a positive relation.

In this context, a strong relationship means the characters communicate a lot.

By themselves the scores are already very meaningful, we can tell that Pam and Jim have the strongest relationship among all the other relations.

We can also notice that Michael, the main character of the show, has an overall higher score with everybody when compared to 'lower-ranked' main characters such as Meredith, Creed, or Darryl.

This makes sense from the perspective that Michael has been communicating more constantly with everybody in the show, so he probably has a stronger relationship with most characters.

NormalizeΒΆ

To extract even more information about the relationships we can normalize the scores, in this case we'll do so by standarizing the values, or calculating their z-scores. This will allow us to see how many standard deviations aways from the mean each relation is.

Simplyfing, we want to see how extreme are those relationships for each characther.

Visualize P-ValuesΒΆ

One way of improving this visualization is by showing the actual p-values, they represent how likelly it should be to find those values in the distribution.

In this case, we'll look for relationships with a lower than 0.05 p-value, to account for 95% of confidence level that those relationships have a statistically significant difference from the average relationships of the analysed characters.

Strongest RelationshipsΒΆ

With 95% of confidence, the bellow listed relationships had a higher amount of conversation score than the average relationships.

Michael -> Dwight 

Dwight -> Michael
Dwight -> Jim 

Jim -> Dwight 
Jim -> Pam 

Pam -> Jim 

Angela -> Dwight 

Andy -> Dwight

Darryl -> Andy

Ryan -> Michael

Stanley -> Phyllis

Visualize the strongest relationships in a network chart

Out[29]:
Text(0.5, 1.0, 'Relationships')

ConclusionΒΆ

  • Jim and Pam have the strongest relationship;
  • Dwight has the biggest amount of strong relationships;
  • None of the main characters have a statisticaly relavant weak relationship;

Words FrequencyΒΆ

The word and terms frequency can give us an interesting perspective of how the characters communicate and what the show is about.

FactsΒΆ


Most Frequent TermsΒΆ

To start we can visualize the show's most frequent words in a word cloud, to do that we're using a bag-of-words algorithm that'll select and display the words and terms with the highest frequency.

We can see in the above visualization that many of the words relate to people, words such as names, and pronouns are very common in their daily communications. We can also see that many of those words have little to no meaning by themselves.

To improve on that we can check what are the distinguishable terms spoken by the characters, in other words, we'll remove words that are common to all characters and focus on the words that are specific to each of the main characters.

TF-IDFΒΆ

Term Frequency - Inverted Document Frequency (TF-IDF), is a method to compare how many times a term appeared in a document with how many documents the term appeared in.

TF * IDF
TF = Frequency(term)
IDF = log ( Number of documents / Number of documents containing the term)

After calculating the TF-IDF scores we get the difference between the mean score for all characters and the character score, this will show how above or below the average each word was said by the character.
The result is then sorted to get the most above the average words for each character.

*test*
How many times the words business appears: 2491
last 5 rows:
Out[32]:
word Michael Jim Pam Dwight Phyllis Stanley Oscar Angela Kevin Ryan Howard Kelly Meredith Darryl Creed Andy sum_score mean_score
zoppity 0.001527 0.0 0.0 0.00000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.001527 0.000204
zoran 0.000000 0.0 0.0 0.00169 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.001690 0.000225
zuckerberg 0.000000 0.0 0.0 0.00169 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.001690 0.000225
zuckerberged 0.000000 0.0 0.0 0.00000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.008459 0.0 0.0 0.008459 0.001128
zwarte 0.000000 0.0 0.0 0.00000 0.0 0.0 0.009056 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.009056 0.001207

The 10 most distinct words by character

Out[33]:
Kevin Phyllis Meredith Angela Stanley Oscar Pam Jim Darryl Creed Ryan Howard Dwight Kelly Andy Michael
0 oscar bob wait sprinkles florida gay um awimowheh mike bratton kelly schrute ryan tuna everybody
1 warning clients van senator scarn angela cece alright man debbie wuphf mose god bernard jan
2 shred ho minute dwight mind kevin mural nope val creed presentation jim fashion erin alright
3 awesome bobs jakey cat toaster senator paint wow warehouse boss thailand ha blah cornell holly
4 phillip personality aint kevin damn dollars gosh rundown yall betty pesto assistant ruff bum scott
5 stacy luke lice phillip heh blake chore definitely justine brown um hay bridesmaid flag beep
6 ooh lettuce manuel senators pretzel angelas jims assistant truck swiss silicon sensei cuz andrew sort
7 cress fanned meredith oscar hudson le brian warmer tacos cartwheel treated manager dots treble somebody
8 dunhduhnadah afghani alcoholic contract wallpaper gerald art uh beanie persons powerpoint regional ravi fail david
9 fluke birdhouse vagina ugh lost spend dad beesly jada devon drake idiot obsessed rob carol

InterpretationΒΆ

We can see some patterns in the disntinguish terms, such as all characters have some person that they mention in a distinguish way, with a high frequency and more than the other characters.

This means a great deal of the show is spent by talking about people, and personal relationships.

A better way to use this table would be to analyse the characters individually and research the combinations of 'Character name' + 'Term'.

This can lead us to a more detailed understanding of the characters, some are more explicit like Pam for example. She have in her distinguish words 'Mural', 'Paint', and 'Art', which are clear indications of her interest in arts.

With other characters we can find some subjective information, like Meredith, in her list we can see words such as 'Lice', 'Alcoholic', and 'Vagina', those words may be considered too intimate or even unconfortable and this is the essense of her character.

ConclusionΒΆ

The Office is a show where the characters are mostly talking about other people and their relationships.

We can also conclude that the TF-IDF scores provide a quick and easy direction for researching on the characters for more details about their personalities.

Individual CharactersΒΆ

One of the many ways of breaking down all this data is by analysing the characters individually, from this point on the previously discussed methods will be addapted for a single character.

Beside from the previously seem data, in this section we'll also explore the ratings.

The options are:
['Kevin' 'Phyllis' 'Meredith' 'Angela' 'Stanley' 'Oscar' 'Pam' 'Jim'
 'Darryl' 'Creed' 'Ryan Howard' 'Dwight' 'Kelly' 'Andy' 'Michael']

Selected character is: Michael

Sentiment PolarityΒΆ

The polarity scores for each dialog were generated by VADER, please consult the previous section 'Sentiment Analysis' or the Data Cleaning and Preparation Notebook for more information about this method and its implementation.

Normalize PolarityΒΆ

The sentiment analysis displays high amounts of Neutral interactions and low amounts of negative and positive for most characters. To better visualize the small differences between those scores we can normalize them.

Out[36]:
POS NEU NEG
chars
Kevin 0.667510 -0.672402 -0.101005
Phyllis -0.220853 0.468629 -0.360813
Meredith -0.011946 -0.833780 1.466491
Angela -0.478777 -0.944125 2.490819
Stanley -2.336108 2.060757 0.943092
Oscar -0.510720 1.249539 -1.143053
Pam 1.486826 -1.186604 -0.780542
Jim 1.368037 -0.986721 -0.896094
Darryl -0.131188 0.396196 -0.406808
Creed -0.945611 1.016878 -0.120823
Ryan Howard 0.351370 0.270397 -1.109912
Dwight -0.403143 0.172421 0.467807
Kelly -0.632555 0.631062 0.145183
Andy 0.535616 -0.315455 -0.481661
Michael 1.261541 -1.326792 -0.112680

Radar ChartsΒΆ

To visualize the three normalized variables (positive, negative, and neutral), we'll be using radar charts, with the normalized data we can more easily compare the extents of each polarity in the selected character.

Polarity DistributionΒΆ

We can also visualize the distribution of the polarity trough the episodes, this should allow us to see changes in the character behavior and outliers that may be interesting to look closer.

Out[38]:
Text(0.5, 1.0, 'Michael Sentiment by Episode')

Words and TermsΒΆ

In this section, we'll repeat the methods used in '5 - Words Frequency', but this time with a single character, and we'll also add a method from spaCy, that can help us identify the entities mentioned in the dialogs.

Distinguish TermsΒΆ

Here we can analyze the most distinguishable terms for a specific character, the sizes are adjusted as per the more distinguishable the term the bigger the font size.

EVERYBODY

JAN

ALRIGHT

HOLLY

SCOTT

BEEP

SORT

SOMEBODY

DAVID

CAROL

Most Frequent WordsΒΆ

Here we're building a word cloud with the most frequent terms the character said, the cleaned version of the text is being used for visualization.

EntitiesΒΆ

Here we'll visualize what are the most commonly mentioned entities, more specifically in this section we'll filter people, organizations, products, locations and events mentioned in the dialogs and them we'll count them to visualize the most mentioned in the show by the selected character

Out[42]:
count
name type
Dwight PERSON 275
Jim PERSON 214
Pam PERSON 148
Ryan PERSON 134
Stanley PERSON 122
Phyllis PERSON 98
Oscar PERSON 95
Kevin PERSON 91
Michael Scott PERSON 71
Jan PERSON 71
Andy PERSON 64
Michael PERSON 62
Scranton ORG 60
David PERSON 46
Holly PERSON 46

In regards to Michael, we can see something common between the words and terms frequencies. They're all strongly related to people.

In the TF-IDF scores, Michael's most distinguishable words have 2 pronouns (Everybody, and Somebody) and 5 names from the top 10 words. In the bag-of-words algorithm its harder to visualize the patterns since there are many meaningless words, but still, we can also see lots of names and pronouns related to people.

The strongest evidence of this is the most frequent entities mentioned by Michael, from the 15 words displayed only one is not a person, and this exception is actually the name of their city. This suggests that Michael is someone whose biggest interests are in people and the community.

https://www.youtube.com/watch?v=vrPgsrfZWOU&feature=youtu.be&t=327

RatingsΒΆ

CorrelationΒΆ

Here we can verify the correlation (Pearson method) between the previously analysed measures and the actual ratings for the episodes

We can also compare any given variable with the actual ratings, this helps us visualize how much related those values are.

The options are:
['dialogs', 'mean_sent', 'mean_words', 'mean_positive', 'mean_negative', 'mean_neutral', 'mean_compound', 'total_sent', 'total_words']

Selected variable: total_words

To better visualize the patterns and movements of the selected variable and the ratings, both of them were interpolated to fit 50 data points. We're losing information by doing so, but it's easier to spot trends and visualize the overall direction of the comparison when we have a lower number of data.

SummaryΒΆ

The show has an overall positive sentiment in the interactions it displays, where the main characters relate a lot with each other, and the most discussed things are the relationships between themselfs.

Some characters have a stronger relationship than the others, like Jim and Pam who play a romantic pair in the show and have the highest number of interactions between all the main characters.

According to the number of dialogs, episodes, seasons, and the size of the dialogs, Michael is by far the main character of the show, he's also the character with the strongest correlations between it's participation and the episodes ratings.

Dwight and Andy also had a very strong participation on the show, while Dwights participation was relativaly constant trought the show, Andy seems to have participated a lot more, but in a lower time range.

The characters also have some sort of group behavior associated to them, where we can identify those groups by their similarities in number of dialogs, episodes and their relationship scores.

Those group relationships also appear to have a similar structure to the positioning of the characters in The Office, those groups can be identified as:

  • Michael and Dwight
  • Jim and Pam;
  • Angela, Oscar, and Kevin;
  • Stanley and Phyllis;
  • Ryan and Kelly;

All of them worked close to each other, and in most groups they all worked in the same department.

We can conclude from this report that The Office is not so much about the work in an office enviorment but more about the life of the people who work there, it does so by exploring more personal aspects of the characters such as their romantic lifes, aspirations, families, friends, personal challenges and disconforts.